59 research outputs found

    ASR decoding in a computational model of human word recognition

    Get PDF
    This paper investigates the interaction between acoustic scores and symbolic mismatch penalties in multi-pass speech decoding techniques that are based on the creation of a segment graph followed by a lexical search. The interaction between acoustic and symbolic mismatches determines to a large extent the structure of the search space of these multipass approaches. The background of this study is a recently developed computational model of human word recognition, called SpeM. SpeM is able to simulate human word recognition data and is built as a multi-pass speech decoder. Here, we focus on unravelling the structure of the search space that is used in SpeM and similar decoding strategies. Finally, we elaborate on the close relation between distances in this search space, and distance measures in search spaces that are based on a combination of acoustic and phonetic features

    Identification of intervocalic consonants in stationary and nonstationary noise

    Get PDF
    The factors which underlie the perception of consonants in noise remain poorly understood. In this study, native listeners identified 24 English consonants spoken by 8 talkers presented in 9 intervocalic contexts with varying stress position. Listeners were tested in 5 noise conditions: tokens were masked by stationary speechshaped noise, a competing talker, 3 and 8 speaker babble and speech-modulated noise, all of which have the long-term spectrum of speech. The rank ordering of consonant identification scores in stationary noise was highly-correlated (r=0.9, p<0.0001) with a similar condition reported by Phatak and Allen [JASA 121: 2312- 2326, 2007], but less so in the 4 nonstationary noise backgrounds (r=0.74, p<0.001). In particular, /y/, /r/, /l/, /f/, /ch/, /sh/, /m/ and most of the plosives showed a wide variation in ranking. These findings suggest that, in addition to the long-term spectrum of the masker, consonant identification is noise is affected by other factors such as temporal fluctuations in the masker, misallocation of foreground/background components and attention

    Articulatory feature classification using convolutional neural networks

    Get PDF
    The ultimate goal of our research is to improve an existing speech-based computational model of human speech recognition on the task of simulating the role of fine-grained phonetic information in human speech processing. As part of this work we are investigating articulatory feature classifiers that are able to create reliable and accurate transcriptions of the articulatory behaviour encoded in the acoustic speech signal. Articulatory feature (AF) modelling of speech has received a considerable amount of attention in automatic speech recognition research. Different approaches have been used to build AF classifiers, most notably multi-layer perceptrons. Recently, deep neural networks have been applied to the task of AF classification. This paper aims to improve AF classification by investigating two different approaches: 1) investigating the usefulness of a deep Convolutional neural network (CNN) for AF classification; 2) integrating the Mel filtering operation into the CNN architecture. The results showed a remarkable improvement in classification accuracy of the CNNs over state-of-the-art AF classification results for Dutch, most notably in the minority classes. Integrating the Mel filtering operation into the CNN architecture did not further improve classification performance

    The role of articulatory feature representation quality in a computational model of human spoken-word recognition

    Get PDF
    Fine-Tracker is a speech-based model of human speech recognition. While previous work has shown that Fine-Tracker is successful at modelling aspects of human spoken-word recognition, its speech recognition performance is not comparable to that of human performance, possibly due to suboptimal intermediate articulatory feature (AF) representations. This study investigates the effect of improved AF representations, obtained using a state-of-the-art deep convolutional network, on Fine-Tracker’s simulation and recognition performance: Although the improved AF quality resulted in improved speech recognition; it, surprisingly, did not lead to an improvement in Fine-Tracker’s simulation power

    Perceptual learning of liquids in older listeners

    Get PDF
    Numerous studies have shown that young listeners can adapt to idiosyncratic pronunciations through lexically-guided perceptual learning (McQueen et al., 2006; Norris et al., 2003). Aging may affect sensitivity to the higher frequencies in the speech signal, which results in the loss of sensitivity to phonetic detail. Nevertheless, short-term adaptation to accents and to time-compressed speech seems to be preserved with aging and with hearing loss (Adank & Janse, 2010; Gordon-Salant et al., 2010). However, the extent of the flexibility of phoneme categories and the conditions under which these phoneme boundary shifts can or cannot occur in an older population have not been investigated yet. This research investigates whether older listeners are able to tune into a speaker like young normal-hearing listeners can, by comparing the perceptual learning effect of older listeners (aged 60+, varying in their hearing sensitivity) and young (normal-hearing) listeners. Moreover, we investigate whether hearing loss affects the ability to learn non-standard phoneme pronunciations. Hearing loss may interfere with perceptual learning, as perceptual evidence in favour of a certain pronunciation variant is weaker. We therefore expected the perceptual learning effect of older listeners to be weaker and less stable than for young listeners. 36 young and 60 older listeners were exposed to an ambiguous [l/ɹ] in Dutch words ending in either /r/ or /l/ and to Dutch words ending in natural /r/ and /l/, in a lexical decision task (following Norris et al., 2003; Scharenborg et al., 2011). Young listeners gave significantly more correct answers to natural than to ambiguous stimuli (p<0.001). Older listeners had fewer correct answers to the natural stimuli (p<0.05), but showed relatively less impact of stimulus ambiguity (p<0.005). Young listeners gave significantly slower responses to ambiguous than to natural stimuli (p<0.001). Older listeners gave slower responses to the natural stimuli than the young listeners (p<0.05), but again were less impacted by stimulus ambiguity (p<0.005). In a subsequent phonetic categorisation task, listeners were confronted with a range of ambiguous sounds from the [l]-[ɹ]-continuum. The results revealed that listeners exposed to ambiguous [l/ɹ] in /r/-final words gave significantly more /r/-responses than listeners exposed to [l/ɹ] in /l/-final words (p<0.001; see also Figure 1). This effect was significantly stronger for the young listeners in block 1, but not so in the subsequent blocks. After dividing the older listener group into one better-hearing and one poorer-hearing group, no interaction was found between exposure condition and hearing status, suggesting that the age group difference in the size of the initial learning effect may not be due to hearing status. Contrary to our expectations, the learning effect for the older listeners remained stable over blocks, while the young listeners showed ‘unlearning’; i.e., the difference in percentage /r/-responses between the two exposure groups of young listeners grew significantly smaller over blocks. Concluding, the learning effect is stronger right after exposure for young listeners, while the effect is longer lasting for older listeners. Our results show that older listeners, with and without hearing loss, can still retune their phoneme categories to facilitate word recognition. Hearing loss does not seem to interfere with perceptual learning. Our results are in line with other evidence that the perceptual system remains flexible throughout the lifespan (Adank & Janse, 2010; Golomb et al., 2007; Peelle & Wingfield, 2005)

    Lexical embedding in spoken Dutch

    Get PDF
    A stretch of speech is often consistent with multiple words, e.g., the sequence /hæm/ is consistent with ‘ham’ but also with the first syllable of ‘hamster’, resulting in temporary ambiguity. However, to what degree does this lexical embedding occur? Analyses on two corpora of spoken Dutch showed that 11.9%-19.5% of polysyllabic word tokens have word-initial embedding, while 4.1%-7.5% of monosyllabic word tokens can appear word-initially embedded. This is much lower than suggested by an analysis of a large dictionary of Dutch. Speech processing thus appears to be simpler than one might expect on the basis of statistics on a dictionary

    Connected digit recognition with class specific word models

    Get PDF
    This work focuses on efficient use of the training material by selecting the optimal set of model topologies. We do this by training multiple word models of each word class, based on a subclassification according to a priori knowledge of the training material. We will examine classification criteria with respect to duration of the word, gender of the speaker, position of the word in the utterance, pauses in the vicinity of the word, and combinations of these. Comparative experiments were carried out on a corpus consisting of Dutch spoken connected digit strings and isolated digits, which are recorded in a wide variety of acoustic conditions. The results show, that classification based on gender of the speaker, position of the digit in the string, pauses in the vicinity of the training tokens, and models based on a combination of these criteria perform significantly better than the set with single models per digit

    The differential roles of lexical and sublexical processing during spoken-word recognition in clear and in noise

    Get PDF
    Successful spoken-word recognition relies on an interplay between lexical and sublexical processing. Previous research demonstrated that listeners readily shift between more lexically-biased and more sublexically-biased modes of processing in response to the situational context in which language comprehension takes place. Recognizing words in the presence of background noise reduces the perceptual evidence for the speech signal and – compared to the clear – results in greater uncertainty. It has been proposed that, when dealing with greater uncertainty, listeners rely more strongly on sublexical processing. The present study tested this proposal using behavioral and electroencephalography (EEG) measures. We reasoned that such an adjustment would be reflected in changes in the effects of variables predicting recognition performance with loci at lexical and sublexical levels, respectively. We presented native speakers of Dutch with words featuring substantial variability in (1) word frequency (locus at lexical level), (2) phonological neighborhood density (loci at lexical and sublexical levels) and (3) phonotactic probability (locus at sublexical level). Each participant heard each word in noise (presented at one of three signal-to-noise ratios) and in the clear and performed a two-stage lexical decision and transcription task while EEG was recorded. Using linear mixed-effects analyses, we observed behavioral evidence that listeners relied more strongly on sublexical processing when speech quality decreased. Mixed-effects modelling of the EEG signal in the clear condition showed that sublexical effects were reflected in early modulations of ERP components (e.g., within the first 300 ms post word onset). In noise, EEG effects occurred later and involved multiple regions activated in parallel. Taken together, we found evidence – especially in the behavioral data – supporting previous accounts that the presence of background noise induces a stronger reliance on sublexical processing
    • …
    corecore